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Abstract 



Stein unbiased risk estimation is generalized twice, from the Gaussian shift model 
to nonparametric families of smooth densities, and from the quadratic risk to more 
general divergence type distances. The development relies on a connection with local 
proper scoring rules. 

(N 

>>■ 1 Introduction: SURE and the Hyvarinen score 

! 

Consider the problem of estimating the parameter 9 in the standard Gaussian shift family 
Pe = M{9, Id), S K*^, based on an observation x G W^. Let T be an estimator of 9 of the 
form T = x + g{x). Using partial integration, Stein p| showed that under weak conditions 
about the quadratic risk R{T, 9) = Eg \T — 9\'^ of T can be estimated unbiasedly by the 
f-H ■ expression R{T) = 2divg'(x) + + d called SURE (Stein unbiased risk estimate), so 

. that Eg R(T) = R{T, 9) for every 9 . Here | • | and (•, •) denote the Euchdean norm 

\ and inner product on W^, respectively, and diwg is the divergence of g. If in particular 

' = V log / for some function / > on , the risk estimate becomes 

R{T) = 2 A log f{x) + I V log /(x) |2 + d, (1) 

where as usual, V denotes the gradient and A = divV the Laplace operator on W^. This 
l/-^ . special case occurs if T is the posterior mean with respect to a prior distribution vr : then 

^ \ T = X + Vlog/(x) where f{x) = J pe{x) d7r{9) is the corresponding mixture density, so 

that 5 = V log /. 

The striking similarity between SURE and the Hyvarinen score [5], 
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has been noted in, e.g., pj. In eq. ([2]), p denotes a sufficiently smooth, strictly posi- 
tive probability density on . Originally, the Hyvarinen score was introduced for score 
■ matching, a minimum distance type estimation method. Its formal similarity to SURE is 

substantiated on reexpressing the risk of T as a distance between densities. Consider the 
Hyvarinen divergence defined for smooth, positive densitites p, q on as 

dR{p,q) = j \V log p{y) -V log q{y)\'^ q{y)dy. (3) 

If p = / is a mixture density as above and q = pe is the density of Pg , we have 
V log f{x) — V logpo[x) = V log f{x) + x — 9 = T — 9, where again T = x + V log f{x) is 
the corresponding posterior mean. Consequently, 

R{T,9) = Ee\T-9\^ = J \V log f{x) -V log pe{x)\^ pe{x)dx = dn{f,Pe), (4) 

that is, the risk R(T,9) of the parameter estimate T = x + Vlog/(x) equals a distance 
between densities, d}i{f,pe). Furthermore, the analogue of SURE in the density scenario 



1 



is the Hyvarinen score H(/, x), essentially. In fact, Hyvarinen's idea, reinventing Stein's, 
was to apply partial integration to ^ which, assuming boundary terms vanish, gives 

dR{p,q) = j {2Alogp{y) + \V log p{y)\'^)q{y)dy + j \V log q{y)\^ q{y) dy- (5) 

cf. [1], [5]. Since / |Vlogpe(x)pp0(x) dx = d {9 G M"^) in the standard normal case, where 
q = Pe, it follows that 

Ee{R{f,x) + d)= Ee {2 Alog f{x) + \V log f{x)\^ + d)=du{f,pe). (6) 

That is, the modified Hyvarinen score H(f, x) + d respresents an unbiased estimate of the 
distance di{{f,pg) of f from the unknown "true" density pg , for any density f > on 
satisfying suitable regularity conditions. 

The purpose of this note is to expand on this aspect of unbiased risk estimation by 
tying it to scoring rules. Local proper scoring rules are constructed as gradients of concave 
functionals [3], [4], and then shown to generalize SURE in that they furnish unbiased 
estimates of modified Bregman type distances. The development is related to (parts of) 
work by Dawid and Lauritzen |1J . See also [2] , [7] . 



2 Local proper scoring rules and unbiased risk estimation 

We restrict the discussion of scoring rules to the setting relevant for this note, and refer to 
[3] for general information. Let V denote the class of all probabilitiy densities with respect 
to the Lebesgue measure on R'^ such that the following conditions hold for every p gV : 
(PI) p G C^; (P2) p > everywhere on M'^; (P3) for every m > and i,j G {1, . . . ,d} 



lim \xr (p{x) + \d,M^)\ + p{x)\\ = 0; 
(P4) there exists a = a{p) > such that for i,j £ {1, . . . , d} , 



lim |x| " I logp(x) 



p{x) 



p{x) 



The class V is quite large, being convex and comprising, e.g., all normal and logistic 
distributions. 

A scoring rule is a mapping S : P x M'^ ^ M assigning a numerical score, S{p,x), to 
the density forecast, p, when the observation that materializes is x. We write S{p,q) = 
f S{p, x) q{x) dx = EqS{p,-) for the expected score when the density forecast is p and 
the probability measure underlying x is q{x)dx. The scoring rule S is (strictly) proper 
relative to V if S{q,q) < S{p,q) for all p,q £ V (with equality only p = q). The 
scoring rule S is local (of order two, for the class V ) if there exists a real function s such 
that 

S(p, x) = s(x, logp(2;), V logp(x), logp(x)) {p£V, x E M'^), 

V^/(x) denoting the Hessian matrix of second-order partial derivatives of a function 
/ : M'^ ^ M at X. 

The classical example of a (strictly) proper local scoring rule is the logarithmic score, 
S(j), x) = — logp(x). Another example is the Hyvarinen score ([2]). The latter can be 
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regarded as being local of order two, in the obvious sense, and the former as local of order 
zero. Local scoring rules of any order m > were recently investigated in [7], in the case 
d = 1. Hereafter, "local" always is understood as "local of order two." 

The following result lifts the construction of local proper scoring rules in [2j from 
the one- to the higher-dimensional case d > 1. Let /C denote the class of the kernels 
k : xW^ ^ R satisfying the following conditions: (Kl) k G C^; (K2) there are 
constants C, r (0, oo) such that whenever k* stands for the function k = k(x, y) or any 
of its partial derivatives up to order two, then |k*(x,y)| < C (1 -|- -|- \y\y {x,y € 
With any k S /C we associate a functional $ = <l>k : P ^ M defined by 



k(x, V logp(x)) ^(a;) (p^V). 



(7) 



In view of the growth and decay conditions (K2), (P4), and (P3), the integral in ([7]) 
exists and is finite for every p S "P. Let V^k denote the partial gradient referring to the 
argument y G M'^ of k = k(x,y), and recall that di'vg{x) stands for the trace of the total 
derivative at x of a function x i-^ g{x) mapping M'^ into itself. 

Theorem 2.1 Let k G /C, and suppose that the associated functional <I> is concave on V. 
Then 



S(p, x) = k(x, V logp(x)) 



1 
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is a local proper scoring rule relative to V. It is strictly proper if <I> is strictly concave. 
Furthermore, if y ^ k(x, y) is concave on for every x G W^, then the functional $ is 
concave on V. 

The Proof follows similar lines as in the case d = 1, see |2i Sections 4.1, 5.1]. We only 
indicate that the tangent construction in pj Section 4.1] yields the scoring rule ([8]). To 
compute the (weak) gradient of $ at (7 G "P, let pt = (1 — t)q + tp where p G "P, t G [0, 1]. 
Formal differentation ignoring all technicalities gives 



^ MPt)] = J §1 i^nPtl <^^ = j [KpJ {P -q)dx + 
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(9) 



wherein we put Kpj(x) = k(x, V logpj(x)) and omitted the argument x of the integrands. 
For the last integral in ([9]) we get by the divergence theorem, assuming the boundary 
integral vanishes. 
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Setting t = in ([9]) and (fTO]l and noting that po = q we find that 



t=o 



j |k(-,Vlogg)-^div 



qVyk{-,V\ogq) 



{p — q)dx . 



dx. (10) 



(11) 



Thus, the gradient of <I> at g is given by the expression in curly brackets in (jlip . The scor- 
ing rule resulting from the tangent construction, S((7, •), differs from this gradient only by 
a correction term which can be shown to vanish. The negligibility of the boundary integral 
in (jlOp . and all the technicalities (existence of integrals, exchangeability of differentiation 
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and integration, etc.) can be settled similarly as in [21, Section 4.1], using the assumptions 
made about the classes V and /C. □ 
Any convex combination of a scoring rule S as in Theorem 12.11 with the logarithmic 
score yields a local proper scoring rule. In the case d = 1, scoring rules of this form 
exhaust the class of all local proper scoring rules [2], [7]. The complete characterization 
in the case d> 1 remains open. 

Examples. Let k G /C be a kernel of the form k(j;,y) = k(y) = ■i/;(|?/|), where ij) \s & 
concave -function on [0, cx)) with ■0(0) = V''(0) = 0- Then y i— > k(y) is concave on W^, 
and the corresponding scoring rule ([8]) is proper. Explicitly we have 



s(p,-) = V'(kl) 
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where a = Vlogp. For %Jj{t) = —t^ we obtain the Hyvarinen score ([2]); putting ip{t) = 
— log cosh t yields another interesting example parallel to [21 Example 5.3]. 

A local scoring rule S that is proper relative to V gives rise to a Bregman type 
divergence measure iis(p, q) = S(p, q) — S{q, q) on V x V. The following representation of 
ds is closely related to |71 Eq. (53)]. 

Theorem 2.2 Suppose that S is of the form ([8]) for some kernel k G /C such that 
y I— 7> k(x, y) is concave on for every x € M"^. Then the divergence ds admits the 
representation 



ds{p,q) 
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Proof. Let p, q (zV. By the assumptions on V and /C, the divergence theorem applied to 
the scalar function u{x) = q{x)/p{x) and the vector field v{x) = p(x) V^k (x, V logp(x)) 



gives 
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The relation (fT2|) follows on writing ds{p,q) = Eg{S{p, ■ ) — S(g, • )}, substituting ([8]) and 
using ([HI), and observing that / q'^^div {qVyk{-,V log q)) qdx = 0. □ 

Note that the expression in curly brackets in (jl2p is nonnegative because for a concave 
function / on one has f{yi) - f{y2) > {yi - y2,V/(yi)) (yi, y2 G M"*). For the 
Hyvarinen score, where k(x,y) = — that expression becomes |Vlogp — Vloggp, and 
ds becomes the Hyvarinen divergence ([3]). 

To clarify the connection with SURE we note that the partial integration in (jl3p was 
used conversely by Stein and Hyvarinen, to pass from the risk representation (|12p to an 
expression of the form Eq{S{p, • ) — S{q, ■ )}. In the latter, the scoring rule S{p, ■ ) may 
serve as an unbiased estimate of Eg S(p, • ), while the term Eg S{q, ■ ) is the same for all 
candidates p, hence can be ignored if the focus is on risk comparison. In nonparametric 
density estimation, e.g., risk comparison of competing estimates is applied for bandwidth 
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selection, using cross-validation. Briefly, if pn = Pn{- \xi, . . . , Xn) is an estimate of the 
unkown density q € V underlying the i. i. d. observations xi,. . . ,Xn that is symmetric 
in the Xi, the cross-validated expression Rn{pn~i) = 'Ylll=i'^{Pn-i ■, Xi) is an unbi- 
ased estimate of g), where Rn{pn,Q) = EgS{pn,q) denotes the modified risk 
ignoring the term EqS{q,-) = S{q, q), which depends only on q. 

The possibility of risk estimation is of course not confined to the local scoring rules con- 
sidered here, as any proper scoring rule S, whether local or not, gives rise to a divergence 
measure ds- Therefore, cross- validatory estimation of the (modified) risk generally is fea- 
sible, although exact unbiasedness as with the local scoring rules may not be achievable 
when global terms are involved. For example, unbiased estimation of the term J p{x)'^ dx 
entering the quadratic score [3] does not seem possible. 

The particular interest of the scoring rules of the form ([8]) ensues from the fact that 
they do not require the knowledge of the normalizing constants of the probability densities, 
which may be unknown or hard to obtain in complex settings [5], [7]. This advantage can 
be combined with other desirable features such as improved robustness by working, for 
instance, with the log cosh scoring rule mentioned above. 
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